CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

https://arxiv.org/abs/1911.00359

https://aclanthology.org/2020.lrec-1.494/

In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages.

Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language.